GPU Accelerated Self-join for the Distance Similarity Metric

نویسندگان

  • Michael Gowanlock
  • Ben Karsin
چکیده

The self-join finds all objects in a dataset within a threshold of each other defined by a similarity metric. As such, the self-join is a building block for the field of databases and data mining, and is employed in Big Data applications. In this paper, we advance a GPU-efficient algorithm for the similarity self-join that uses the Euclidean distance metric. The search-and-refine strategy is an efficient approach for low dimensionality datasets, as index searches degrade with increasing dimension (i.e., the curse of dimensionality). Thus, we target the low dimensionality problem, and compare our GPU self-join to a search-and-refine implementation, and a state-of-the-art parallel algorithm. In low dimensionality, there are several unique challenges associated with efficiently solving the self-join problem on the GPU. Low dimensional data often results in higher data densities, causing a significant number of distance calculations and a large result set. As dimensionality increases, index searches become increasingly exhaustive, forming a performance bottleneck. We advance several techniques to overcome these challenges using the GPU. The techniques we propose include a GPU-efficient index that employs a bounded search, a batching scheme to accommodate large result set sizes, and a reduction in distance calculations through duplicate search removal. Our GPU self-join outperforms both search-and-refine and state-of-the-art algorithms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Similarity Join in Metric Spaces Using eD-Index

Similarity join in distance spaces constrained by the metric postulates is the necessary complement of more famous similarity range and the nearest neighbor search primitives. However, the quadratic computational complexity of similarity joins prevents from applications on large data collections. We present the eD-Index, an extension of D-index, and we study an application of the eDIndex to imp...

متن کامل

Heads-Join: Efficient Earth Mover's Distance Similarity Joins on Hadoop

The Earth Mover’s Distance (EMD) similarity join has a number of important applications such as near duplicate image retrieval and distributed based pattern analysis. However, the computational cost of EMD is super cubic and consequently the EMD similarity join operation is prohibitive for datasets of even medium size. We propose to employ the Hadoop platform to speed up the operation. Simply p...

متن کامل

Solving Multiple Queries through a Permutation Index in GPU

Query-by-content by means of similarity search is a fundamental operation for applications that deal with multimedia data. For this kind of query it is meaningless to look for elements exactly equal to the one given as query. Instead, we need to measure dissimilarity between the query object and each database object. The metric space model is a paradigm that allows modeling all similarity searc...

متن کامل

Accelerating Neuroimage Registration through Parallel Computation of Similarity Metric

Neuroimage registration is crucial for brain morphometric analysis and treatment efficacy evaluation. However, existing advanced registration algorithms such as FLIRT and ANTs are not efficient enough for clinical use. In this paper, a GPU implementation of FLIRT with the correlation ratio (CR) as the similarity metric and a GPU accelerated correlation coefficient (CC) calculation for the symme...

متن کامل

Access Structures for Advanced Similarity Search in Metric Spaces

Similarity retrieval is an important paradigm for searching in environments where exact match has little meaning. Moreover, in order to enlarge the set of data types for which the similarity search can efficiently be performed, the notion of mathematical metric space provides a useful abstraction for similarity. In this paper we consider the problem of organizing and searching large data-sets f...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018